Co-Occurrrence Patterns among Collocations: A Tool for Corpus-Based Lexical Knowledge Acquisition

نویسنده

  • Douglas Biber
چکیده

One of the main problems for applied natural language processing is gaps in the lexicon, including missing words and word senses, and inadequate descriptions of word use in context. Traditional lexicography has similar concerns. The availability of large, on-line text corpora provides a straightforward tool for enlarging the stock of words included in a lexicon. The identification of additional word senses and uses is more problematic, however. Much recent lexicographic work employs concordances generated from text corpora for these purposes. While this approach provides a more solid empirical basis than traditional lexicographic approaches (which depend on the manual collection and sorting of citation index cards), concordances can actually provide too much data. For example, a concordance for the word certain produced on an 11.6 million-word subsample of the Longman/Lancaster Corpus generated 3,424 entries; a concordance for the word right from the same subcorpus generated 7,619 entries. Simply determining the number of different senses in a database of this size is a daunting task; to accurately group different uses or rank them in order of importance is not really feasible without the use of additional tools. One such tool is to simply sort concordance lines according to their different collocational patterns. Entries can be sorted according to their collocates on both the left and the right. Many of these collocational pairs show a strong relation to a particular word sense (e.g., contrast right ear and right away), and thus analysis of collocational relations has become an important tool for lexical knowledge acquisition (see Sinclair 1991; Smadja 1991; Zernik 1991). In addition, there are statistical tools that can help determine the relative strength of collocational relations. For example, Church and Hanks (1990) describe the use of the mutual information index for this purpose (cf. Calzolari and Bindi 1990). Church et al. (1991) further describe the use of t-scores to assess the extent of the differences between the collocational patterns of nearly synonymous words. These tools are important in that the strongest collocational associations often represent different word senses, and thus 'they provide a powerful set of suggestions to the lexicographer for what needs to be accounted for in choosing a set of semantic tags' (Church and Hanks 1990, p. 28). However, such tools do not directly characterize word senses or even provide any direct indication of the number of different senses that a word has. 1 Further, these

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Discourse Community Collocations and L2 Writing Content

Taking the position that writing can be an important skill to foster knowledge building pedagogy, this article explores vocabulary as a supportive tool for this purpose. Having this in mind, a compilation of conceptually loaded vocabularies pertaining to seven discourse communities was developed, two of which were given to a group of L2 writers to investigate the implications of phraseology for...

متن کامل

The Impact of Teaching Corpus-based Collocation on EFL Learners' Writing Ability

Abstract The present study explores the impact of corpus-based collocation instruction on intermediate Iranian EFL learners' writing ability. For this study, 84 Iranian learners, studying English as a foreign language in Bayan Institute, Iran, were selected and were randomly divided into two groups, experimental and control. Conventional methods of writing instruction were taught to the control...

متن کامل

The Impact of Teaching Corpus-based Collocation on EFL Learners' Writing Ability

Abstract The present study explores the impact of corpus-based collocation instruction on intermediate Iranian EFL learners' writing ability. For this study, 84 Iranian learners, studying English as a foreign language in Bayan Institute, Iran, were selected and were randomly divided into two groups, experimental and control. Conventional methods of writing instruction were taught to the control...

متن کامل

Retrieving Collocations by Co-occurrences and Word Order Constraints

In this paper, we describe a method for automatically retrieving collocations from large text corpora. This method retrieve collocations in the following stages: 1) extracting strings of characters as units of collocations 2) extracting recurrent combinations of strings in accordance with their word order in a corpus as collocations. Through the method, various range of collocations, especially...

متن کامل

Acquiring Collocations For Lexical Choice Between Near-Synonyms

We extend a lexical knowledge-base of near-synonym differences with knowledge about their collocational behaviour. This type of knowledge is useful in the process of lexical choice between near-synonyms. We acquire collocations for the near-synonyms of interest from a corpus (only collocations with the appropriate sense and part-of-speech). For each word that collocates with a nearsynonym we us...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Computational Linguistics

دوره 19  شماره 

صفحات  -

تاریخ انتشار 1993